Fondamenti di Analisi Dati - Spambase¶
Rosario Scavo (1000037803)¶
The dataset can be downloaded from here: http://archive.ics.uci.edu/dataset/94/spambase
Dataset description ¶
The dataset includes various types of content that fall under the category of "spam", such as advertisements, chain letters, make-money-fast schemes, and pornography. The spam emails were collected from individuals who reported spam and the postmaster. On the other hand, non-spam emails were collected from personal and work files, where the presence of the word 'george' and the area code '650' were used as indicators of non-spam.
The central goal is to establish a classification rule to identify spam messages based on the frequency of specific words, numbers, characters, or consecutive capital letters in phrases. We will utilize various classification algorithms, including logistic regression (LR), Support Vector Machine (SVM), Decision-Tree, Random Forest and K-nearest neighbors algorithm (KNN), to achieve this. These algorithms will be optimized through appropriate data preparation, transformation, and hyperparameter tuning using built-in Python functions. Additionally, we will determine the appropriate metrics to maximize and their impact on classification performance.
However, effective implementation requires thorough data analysis. Without prior data understanding, employing classifiers becomes challenging, if not impossible. This analysis will involve attribute exploration, variable type verification, missing value identification, feature-level metric analysis (mean, standard deviation, quantiles, etc.), feature importance determination for spam/non-spam classification, and outlier detection and analysis.
# imports
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from scipy.stats import chi2_contingency
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from statsmodels.formula.api import logit
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
import graphviz
from sklearn.tree import export_graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import warnings
from statsmodels.tools.sm_exceptions import ConvergenceWarning
warnings.simplefilter('ignore', ConvergenceWarning)
warnings.simplefilter('ignore', RuntimeWarning)
names_list_filepath = 'spambase/names.txt'
attribute_names = []
with open(names_list_filepath, 'r') as file:
attribute_names = file.read().splitlines()
data = pd.read_csv('spambase/spambase.data', names=attribute_names)
data
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | Class | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00 | 0.64 | 0.64 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.778 | 0.000 | 0.000 | 3.756 | 61 | 278 | 1 |
| 1 | 0.21 | 0.28 | 0.50 | 0.0 | 0.14 | 0.28 | 0.21 | 0.07 | 0.00 | 0.94 | ... | 0.000 | 0.132 | 0.0 | 0.372 | 0.180 | 0.048 | 5.114 | 101 | 1028 | 1 |
| 2 | 0.06 | 0.00 | 0.71 | 0.0 | 1.23 | 0.19 | 0.19 | 0.12 | 0.64 | 0.25 | ... | 0.010 | 0.143 | 0.0 | 0.276 | 0.184 | 0.010 | 9.821 | 485 | 2259 | 1 |
| 3 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.000 | 0.137 | 0.0 | 0.137 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| 4 | 0.00 | 0.00 | 0.00 | 0.0 | 0.63 | 0.00 | 0.31 | 0.63 | 0.31 | 0.63 | ... | 0.000 | 0.135 | 0.0 | 0.135 | 0.000 | 0.000 | 3.537 | 40 | 191 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4596 | 0.31 | 0.00 | 0.62 | 0.0 | 0.00 | 0.31 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.232 | 0.0 | 0.000 | 0.000 | 0.000 | 1.142 | 3 | 88 | 0 |
| 4597 | 0.00 | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.353 | 0.000 | 0.000 | 1.555 | 4 | 14 | 0 |
| 4598 | 0.30 | 0.00 | 0.30 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.102 | 0.718 | 0.0 | 0.000 | 0.000 | 0.000 | 1.404 | 6 | 118 | 0 |
| 4599 | 0.96 | 0.00 | 0.00 | 0.0 | 0.32 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.057 | 0.0 | 0.000 | 0.000 | 0.000 | 1.147 | 5 | 78 | 0 |
| 4600 | 0.00 | 0.00 | 0.65 | 0.0 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.000 | 0.0 | 0.125 | 0.000 | 0.000 | 1.250 | 5 | 40 | 0 |
4601 rows × 58 columns
Attribute description ¶
- The last column of 'spambase.data' (Class) indicates whether the email was considered spam (1) or not (0), i.e., unsolicited commercial email.
- Most attributes indicate whether a specific word or character frequently occurs in the email.
- Attributes 55-57 (run-length attributes) measure the length of sequences of consecutive capital letters.
Definitions of Attributes:¶
48 continuous real [0,100] attributes of type
word_freq_WORD:- Percentage of words in the email that match the specified word.
- Calculation: $\frac{100 \times (\text{Number of times the word appears in the email})}{\text{Total number of words in the email}}$
6 continuous real [0,100] attributes of type
char_freq_CHAR:- Percentage of characters in the email that match the specified character.
- Calculation: 100 * (number of occurrences of the character) / total characters in the email.
1 continuous real [1,...] attribute of type
capital_run_length_average:- Average length of uninterrupted sequences of capital letters.
1 continuous integer [1,...] attribute of type
capital_run_length_longest:- Length of the longest uninterrupted sequence of capital letters.
1 continuous integer [1,...] attribute of type
capital_run_length_total:- Sum of the length of uninterrupted sequences of capital letters.
- Total number of capital letters in the email.
1 nominal {0,1} class attribute of type
spam:- Denotes whether the email was considered spam (1) or not (0), i.e., unsolicited commercial email.
data.keys()
Index(['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d',
'word_freq_our', 'word_freq_over', 'word_freq_remove',
'word_freq_internet', 'word_freq_order', 'word_freq_mail',
'word_freq_receive', 'word_freq_will', 'word_freq_people',
'word_freq_report', 'word_freq_addresses', 'word_freq_free',
'word_freq_business', 'word_freq_email', 'word_freq_you',
'word_freq_credit', 'word_freq_your', 'word_freq_font', 'word_freq_000',
'word_freq_money', 'word_freq_hp', 'word_freq_hpl', 'word_freq_george',
'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85',
'word_freq_technology', 'word_freq_1999', 'word_freq_parts',
'word_freq_pm', 'word_freq_direct', 'word_freq_cs', 'word_freq_meeting',
'word_freq_original', 'word_freq_project', 'word_freq_re',
'word_freq_edu', 'word_freq_table', 'word_freq_conference',
'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!',
'char_freq_$', 'char_freq_#', 'capital_run_length_average',
'capital_run_length_longest', 'capital_run_length_total', 'Class'],
dtype='object')
- Number of instances: 4601, of which 1813 are SPAM (39.4%)
- Number of attributes: 58 (57 continuous, 1 categorical representing the class label).
class_counts = data['Class'].value_counts()
print(class_counts)
print("\n")
data.info()
Class 0 2788 1 1813 Name: count, dtype: int64 <class 'pandas.core.frame.DataFrame'> RangeIndex: 4601 entries, 0 to 4600 Data columns (total 58 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 word_freq_make 4601 non-null float64 1 word_freq_address 4601 non-null float64 2 word_freq_all 4601 non-null float64 3 word_freq_3d 4601 non-null float64 4 word_freq_our 4601 non-null float64 5 word_freq_over 4601 non-null float64 6 word_freq_remove 4601 non-null float64 7 word_freq_internet 4601 non-null float64 8 word_freq_order 4601 non-null float64 9 word_freq_mail 4601 non-null float64 10 word_freq_receive 4601 non-null float64 11 word_freq_will 4601 non-null float64 12 word_freq_people 4601 non-null float64 13 word_freq_report 4601 non-null float64 14 word_freq_addresses 4601 non-null float64 15 word_freq_free 4601 non-null float64 16 word_freq_business 4601 non-null float64 17 word_freq_email 4601 non-null float64 18 word_freq_you 4601 non-null float64 19 word_freq_credit 4601 non-null float64 20 word_freq_your 4601 non-null float64 21 word_freq_font 4601 non-null float64 22 word_freq_000 4601 non-null float64 23 word_freq_money 4601 non-null float64 24 word_freq_hp 4601 non-null float64 25 word_freq_hpl 4601 non-null float64 26 word_freq_george 4601 non-null float64 27 word_freq_650 4601 non-null float64 28 word_freq_lab 4601 non-null float64 29 word_freq_labs 4601 non-null float64 30 word_freq_telnet 4601 non-null float64 31 word_freq_857 4601 non-null float64 32 word_freq_data 4601 non-null float64 33 word_freq_415 4601 non-null float64 34 word_freq_85 4601 non-null float64 35 word_freq_technology 4601 non-null float64 36 word_freq_1999 4601 non-null float64 37 word_freq_parts 4601 non-null float64 38 word_freq_pm 4601 non-null float64 39 word_freq_direct 4601 non-null float64 40 word_freq_cs 4601 non-null float64 41 word_freq_meeting 4601 non-null float64 42 word_freq_original 4601 non-null float64 43 word_freq_project 4601 non-null float64 44 word_freq_re 4601 non-null float64 45 word_freq_edu 4601 non-null float64 46 word_freq_table 4601 non-null float64 47 word_freq_conference 4601 non-null float64 48 char_freq_; 4601 non-null float64 49 char_freq_( 4601 non-null float64 50 char_freq_[ 4601 non-null float64 51 char_freq_! 4601 non-null float64 52 char_freq_$ 4601 non-null float64 53 char_freq_# 4601 non-null float64 54 capital_run_length_average 4601 non-null float64 55 capital_run_length_longest 4601 non-null int64 56 capital_run_length_total 4601 non-null int64 57 Class 4601 non-null int64 dtypes: float64(55), int64(3) memory usage: 2.0 MB
Dataset analysis ¶
Dataset integrity ¶
Before analyzing the data, let's verify that the 'Class' attribute only contains the values 1 and 0. Additionally, we will check for any NaN values in the dataset.
data['Class'].unique()
array([1, 0])
count_nan_in_df = data.isnull().sum().sum()
print(f'Number of NaN values: {count_nan_in_df}')
Number of NaN values: 0
For simplicity, we will change the class type to bool and rename it to 'spam.' Consequently, when a record has spam=True, it indicates that the email is spam.
data['spam'] = data['Class'].astype(bool)
data = data.drop(columns=['Class'])
data['spam']
0 True
1 True
2 True
3 True
4 True
...
4596 False
4597 False
4598 False
4599 False
4600 False
Name: spam, Length: 4601, dtype: bool
Utilizing the describe function's min and max lines, which provide insights into the minimum and maximum values for each column, we can confirm that the values of attributes indicating frequencies adhere to the established ranges. Specifically, the lower limit of the range is duly respected, while the upper limit is one unit higher due to the multiplication of frequencies by 100 (percentage), as explained earlier.
Issue: Matrix Sparsity¶
However, a notable observation is that all quartile values are zero. This phenomenon arises from the inherent sparsity of the matrix, where numerous frequency-related values are zero in the majority of records. Consequently, the data is concentrated near zero, introducing noise that could potentially compromise the statistical analysis of the dataset.
To address this issue, in a later stage of the project, a decision was made to replace values equal to 0.0 with NaN for attributes indicating frequencies. This strategic move aimed to mitigate the impact of matrix sparsity, enhancing the dataset's suitability for robust statistical analysis.
data.describe()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | word_freq_conference | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | ... | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 |
| mean | 0.104553 | 0.213015 | 0.280656 | 0.065425 | 0.312223 | 0.095901 | 0.114208 | 0.105295 | 0.090067 | 0.239413 | ... | 0.031869 | 0.038575 | 0.139030 | 0.016976 | 0.269071 | 0.075811 | 0.044238 | 5.191515 | 52.172789 | 283.289285 |
| std | 0.305358 | 1.290575 | 0.504143 | 1.395151 | 0.672513 | 0.273824 | 0.391441 | 0.401071 | 0.278616 | 0.644755 | ... | 0.285735 | 0.243471 | 0.270355 | 0.109394 | 0.815672 | 0.245882 | 0.429342 | 31.729449 | 194.891310 | 606.347851 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.588000 | 6.000000 | 35.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.065000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.276000 | 15.000000 | 95.000000 |
| 75% | 0.000000 | 0.000000 | 0.420000 | 0.000000 | 0.380000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.160000 | ... | 0.000000 | 0.000000 | 0.188000 | 0.000000 | 0.315000 | 0.052000 | 0.000000 | 3.706000 | 43.000000 | 266.000000 |
| max | 4.540000 | 14.280000 | 5.100000 | 42.810000 | 10.000000 | 5.880000 | 7.270000 | 11.110000 | 5.260000 | 18.180000 | ... | 10.000000 | 4.385000 | 9.752000 | 4.081000 | 32.478000 | 6.003000 | 19.829000 | 1102.500000 | 9989.000000 | 15841.000000 |
8 rows × 57 columns
data[data['spam'] == True].iloc[:, 0:-4].max()
word_freq_make 4.540 word_freq_address 4.760 word_freq_all 3.700 word_freq_3d 42.810 word_freq_our 7.690 word_freq_over 2.540 word_freq_remove 7.270 word_freq_internet 11.110 word_freq_order 3.330 word_freq_mail 7.550 word_freq_receive 2.610 word_freq_will 6.250 word_freq_people 5.550 word_freq_report 4.760 word_freq_addresses 4.410 word_freq_free 16.660 word_freq_business 7.140 word_freq_email 9.090 word_freq_you 12.500 word_freq_credit 18.180 word_freq_your 11.110 word_freq_font 17.100 word_freq_000 5.450 word_freq_money 12.500 word_freq_hp 3.580 word_freq_hpl 1.770 word_freq_george 1.280 word_freq_650 9.090 word_freq_lab 0.470 word_freq_labs 3.380 word_freq_telnet 1.360 word_freq_857 0.470 word_freq_data 2.120 word_freq_415 1.350 word_freq_85 1.910 word_freq_technology 1.620 word_freq_1999 5.050 word_freq_parts 1.560 word_freq_pm 1.880 word_freq_direct 2.220 word_freq_cs 0.100 word_freq_meeting 0.450 word_freq_original 0.890 word_freq_project 1.160 word_freq_re 5.550 word_freq_edu 2.730 word_freq_table 0.460 word_freq_conference 0.770 char_freq_; 1.117 char_freq_( 9.752 char_freq_[ 1.171 char_freq_! 7.843 char_freq_$ 6.003 char_freq_# 19.829 dtype: float64
data.iloc[:, :-4] /= 100
data.describe()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | word_freq_conference | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | ... | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 | 4601.000000 |
| mean | 0.001046 | 0.002130 | 0.002807 | 0.000654 | 0.003122 | 0.000959 | 0.001142 | 0.001053 | 0.000901 | 0.002394 | ... | 0.000319 | 0.000386 | 0.001390 | 0.000170 | 0.002691 | 0.000758 | 0.000442 | 5.191515 | 52.172789 | 283.289285 |
| std | 0.003054 | 0.012906 | 0.005041 | 0.013952 | 0.006725 | 0.002738 | 0.003914 | 0.004011 | 0.002786 | 0.006448 | ... | 0.002857 | 0.002435 | 0.002704 | 0.001094 | 0.008157 | 0.002459 | 0.004293 | 31.729449 | 194.891310 | 606.347851 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.588000 | 6.000000 | 35.000000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000650 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2.276000 | 15.000000 | 95.000000 |
| 75% | 0.000000 | 0.000000 | 0.004200 | 0.000000 | 0.003800 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.001600 | ... | 0.000000 | 0.000000 | 0.001880 | 0.000000 | 0.003150 | 0.000520 | 0.000000 | 3.706000 | 43.000000 | 266.000000 |
| max | 0.045400 | 0.142800 | 0.051000 | 0.428100 | 0.100000 | 0.058800 | 0.072700 | 0.111100 | 0.052600 | 0.181800 | ... | 0.100000 | 0.043850 | 0.097520 | 0.040810 | 0.324780 | 0.060030 | 0.198290 | 1102.500000 | 9989.000000 | 15841.000000 |
8 rows × 57 columns
Descriptive statistics ¶
Emails can be categorized into two groups: spam and non-spam. To better understand these categories, it is important to calculate fundamental statistics for each group. Furthermore, we aim to pinpoint specific characteristics that could significantly influence the classification of an email.
spam = data[data['spam'] == True]
non_spam = data[data['spam'] == False]
spam.describe()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | word_freq_conference | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | ... | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 | 1813.000000 |
| mean | 0.001523 | 0.001646 | 0.004038 | 0.001647 | 0.005140 | 0.001749 | 0.002754 | 0.002081 | 0.001701 | 0.003505 | ... | 0.000021 | 0.000206 | 0.001090 | 0.000082 | 0.005137 | 0.001745 | 0.000789 | 9.519165 | 104.393271 | 470.619415 |
| std | 0.003106 | 0.003489 | 0.004807 | 0.022191 | 0.007072 | 0.003219 | 0.005721 | 0.005449 | 0.003548 | 0.006314 | ... | 0.000268 | 0.000916 | 0.002821 | 0.000474 | 0.007442 | 0.003605 | 0.006119 | 49.846186 | 299.284969 | 825.081179 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 2.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000940 | 0.000000 | 0.000000 | 2.324000 | 15.000000 | 93.000000 |
| 50% | 0.000000 | 0.000000 | 0.003000 | 0.000000 | 0.002900 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000650 | 0.000000 | 0.003310 | 0.000800 | 0.000000 | 3.621000 | 38.000000 | 194.000000 |
| 75% | 0.001700 | 0.002100 | 0.006400 | 0.000000 | 0.007800 | 0.002400 | 0.003400 | 0.001900 | 0.001900 | 0.005100 | ... | 0.000000 | 0.000000 | 0.001440 | 0.000000 | 0.006450 | 0.002110 | 0.000180 | 5.708000 | 84.000000 | 530.000000 |
| max | 0.045400 | 0.047600 | 0.037000 | 0.428100 | 0.076900 | 0.025400 | 0.072700 | 0.111100 | 0.033300 | 0.075500 | ... | 0.007700 | 0.011170 | 0.097520 | 0.011710 | 0.078430 | 0.060030 | 0.198290 | 1102.500000 | 9989.000000 | 15841.000000 |
8 rows × 57 columns
non_spam.describe()
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | word_freq_conference | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | ... | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 | 2788.000000 |
| mean | 0.000735 | 0.002445 | 0.002006 | 0.000009 | 0.001810 | 0.000445 | 0.000094 | 0.000384 | 0.000380 | 0.001672 | ... | 0.000512 | 0.000503 | 0.001586 | 0.000227 | 0.001100 | 0.000116 | 0.000217 | 2.377301 | 18.214491 | 161.470947 |
| std | 0.002978 | 0.016332 | 0.005030 | 0.000213 | 0.006145 | 0.002229 | 0.001105 | 0.002472 | 0.001985 | 0.006432 | ... | 0.003652 | 0.003034 | 0.002606 | 0.001349 | 0.008209 | 0.000696 | 0.002439 | 5.113685 | 39.084792 | 355.738403 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 1.000000 |
| 25% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.384000 | 4.000000 | 18.750000 |
| 50% | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000645 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.857000 | 10.000000 | 54.000000 |
| 75% | 0.000000 | 0.000000 | 0.001200 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.002220 | 0.000000 | 0.000270 | 0.000000 | 0.000000 | 2.555000 | 18.000000 | 141.000000 |
| max | 0.043400 | 0.142800 | 0.051000 | 0.008700 | 0.100000 | 0.058800 | 0.030700 | 0.058800 | 0.052600 | 0.181800 | ... | 0.100000 | 0.043850 | 0.052770 | 0.040810 | 0.324780 | 0.020380 | 0.074070 | 251.000000 | 1488.000000 | 5902.000000 |
8 rows × 57 columns
Histograms distribution ¶
Histograms visually represent the distribution of values within each feature, providing valuable insights into the patterns and tendencies associated with spam and non-spam emails. By scrutinizing these histograms, one can discern any differences in the distributions, thereby gaining valuable insights into the characteristic features that distinguish spam from legitimate messages. For instance, when comparing word_freq_business with word_freq_3d, it is clear that the latter is a good feature for discriminating between spam and nonspam.
def plot_histogram(feature, spam, non_spam):
plt.hist(spam[feature], bins=20, alpha=0.5, label='Spam')
plt.hist(non_spam[feature], bins=20, alpha=0.5, label='Non-Spam')
plt.xlabel(feature)
plt.ylabel('Frequency')
plt.title(f'Histogram of {feature} for Spam and Non-Spam Emails')
plt.legend()
plt.show()
plot_histogram('word_freq_business', spam, non_spam)
plot_histogram('word_freq_3d', spam, non_spam)
Word frequencies ¶
Certain columns showcase markedly higher maximum values within one class, in contrast with relatively lower values in the counterpart class. These observations provide valuable insights into potential discriminative features crucial for email classification.
In order to identify influential features impacting email classification, we scrutinize the features by averaging the values of the word frequencies and plotting them.
mean_wf = data.groupby('spam').mean()
mean_wr_fr = mean_wf.iloc[:, 0:-9]
nospam_wr_fr = mean_wr_fr.iloc[0]
spam_wr_fr = mean_wr_fr.iloc[1]
The initial graph presented here juxtaposes the average word frequency values in spam (depicted in orange) and non-spam emails (depicted in blue). Notably, certain words like "3d" (as shown previously) and "you" exhibit higher average frequencies in SPAM emails, while others like "hp," "address," "font," and "george" are more prevalent in non-spam emails. This suggests that the frequency of specific words plays a key role in email classification.
Following a similar approach, I extended the analysis to focus on the frequencies of special characters. It is commonly observed that non-spam emails tend to display a significant presence of such characters.
This comparative analysis provides valuable insights into the distinctive word and character frequency patterns between spam and non-spam emails, contributing to a better understanding of classification dynamics.
plt.figure(figsize=(16, 9))
plt.bar(nospam_wr_fr.index, nospam_wr_fr.values, width=1, alpha=0.8)
plt.bar(spam_wr_fr.index, spam_wr_fr.values, width=1, alpha=0.8)
plt.xticks(rotation='vertical')
plt.legend(['non_spam', 'spam'])
plt.grid()
plt.show()
Feature ratios ¶
In order to select influential features impacting email classification, we scrutinize the features by averaging the values within the Spambase dataset and assessing the ratios between spam and non-spam emails. We show only the features greater than the average of the ratio.
spam_mean = spam.mean()
non_spam_mean = non_spam.mean()
spam_diff = pd.concat(
[spam_mean, non_spam_mean, spam_mean/non_spam_mean], axis=1)
# remove last row (spam column)
spam_diff = spam_diff[:-1]
spam_diff.columns = ['Spam', 'Non-Spam', 'Ratio']
spam_diff.sort_values(by='Ratio', ascending=False, inplace=True)
spam_diff_mean = spam_diff['Ratio'].mean()
selected_spam_diff = spam_diff[spam_diff['Ratio'] > spam_diff_mean]
selected_spam_diff
| Spam | Non-Spam | Ratio | |
|---|---|---|---|
| word_freq_3d | 0.001647 | 0.000009 | 185.872477 |
| word_freq_000 | 0.002471 | 0.000071 | 34.857704 |
| word_freq_remove | 0.002754 | 0.000094 | 29.351310 |
| word_freq_credit | 0.002055 | 0.000076 | 27.117520 |
| char_freq_$ | 0.001745 | 0.000116 | 14.978608 |
| word_freq_addresses | 0.001121 | 0.000083 | 13.474663 |
| word_freq_money | 0.002129 | 0.000171 | 12.421667 |
We can plot the distribution of the ratios to have a better idea.
spam_diff['Ratio'].plot(kind='bar', figsize=(10, 6))
plt.xlabel('Features')
plt.ylabel('Ratio')
plt.title('Spam vs Non-Spam Ratio Comparison')
plt.show()
spam_indicators = list(selected_spam_diff.index.values)
spam_indicators.append('spam')
spam_indicators
['word_freq_3d', 'word_freq_000', 'word_freq_remove', 'word_freq_credit', 'char_freq_$', 'word_freq_addresses', 'word_freq_money', 'spam']
Upon closer examination of certain word pairs, a discernible trend emerges: the joint appearance of both words in an email often suggests a higher likelihood of it being classified as spam. Furthermore, there is an intriguing correlation with word frequency, where a higher frequency is indicative of a higher likelihood of the email being categorized as spam.
%%warnings ignore
pair_spam = sns.pairplot(data[spam_indicators].iloc[::-1], hue="spam")
pair_spam.fig.suptitle('SPAM indicators', y=1.01, fontsize=20)
UsageError: Cell magic `%%warnings` not found.
Hypothesis testing (chi-square test) on features ¶
The p-values obtained through the chi-square test serve as crucial indicators in understanding the relationship between the examined feature (independent variable) and the target variable 'spam.' The null hypothesis, in this context, posits no association or difference between the feature and the likelihood of an email being classified as spam.
Interpretation Guidelines:¶
Small p-value (e.g., < 0.05):
- Conclusion: Reject the null hypothesis.
- Implication: Strong evidence exists, suggesting an association or difference between the feature and the 'spam' variable. The feature is likely to be statistically significant in predicting spam.
Large p-value (e.g., > 0.05):
- Conclusion: Fail to reject the null hypothesis.
- Implication: Insufficient evidence to conclude an association or difference between the feature and the 'spam' variable. The feature may not be statistically significant in predicting spam.
A commonly used significance level (alpha) is 0.05. If a p-value is less than or equal to alpha, the null hypothesis is rejected. Careful consideration of these p-values allows the identification of features that play a significant role in predicting spam.
To calculate the p-values we can engage the Chi-Square Test:
The Chi-Square test is a statistical method used to determine if there is a significant association between two categorical variables. It compares the observed distribution of categorical data with the distribution that would be expected if the variables were independent. The test yields a p-value, indicating the probability of obtaining the observed distribution by chance.
Formula:
The Chi-Square test statistic (χ²) is calculated using the formula:
$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} $$where:
- $O_i$ is the observed frequency in each category,
- $E_i$ is the expected frequency in each category assuming independence.
The test compares the sum of squared differences between observed and expected frequencies, normalized by the expected frequencies. A higher Chi-Square value suggests a greater difference between observed and expected values, and a lower p-value indicates stronger evidence against the null hypothesis of independence.
p_values = {}
for column in data.columns[:-1]:
contingency_table = pd.crosstab(data[column], data['spam'])
_, p_value, _, _ = chi2_contingency(contingency_table)
p_values[column] = round(p_value, 5)
pd.crosstab(data['capital_run_length_total'], data['spam'])
| spam | False | True |
|---|---|---|
| capital_run_length_total | ||
| 1 | 9 | 0 |
| 2 | 8 | 5 |
| 3 | 31 | 1 |
| 4 | 46 | 1 |
| 5 | 114 | 1 |
| ... | ... | ... |
| 9088 | 0 | 1 |
| 9090 | 0 | 1 |
| 9163 | 0 | 1 |
| 10062 | 0 | 1 |
| 15841 | 0 | 1 |
919 rows × 2 columns
sorted_p_values = dict(
sorted(p_values.items(), key=lambda item: float(item[1]), reverse=True))
keys = sorted_p_values.keys()
values = [float(v) for v in sorted_p_values.values()]
plt.figure(figsize=(10, 5))
plt.bar(keys, values)
plt.xticks(rotation=90)
plt.xlabel('Features')
plt.ylabel('P-values')
plt.title('P-values for each feature')
plt.show()
Features with Higher P-Values:
- word_freq_cs (0.73532)
- word_freq_table (0.28752)
- word_freq_conference (0.25999)
- word_freq_project (0.23742)
- word_freq_parts (0.22839)
- word_freq_data (0.13572)
- char_freq_[ (0.11276)
- word_freq_meeting (0.08921)
These results suggest that the mentioned features may not be as discriminatory in identifying spam emails compared to others. Notably, features selected previously as spam indicators, such as 'word_freq_3d,' 'word_freq_remove,' 'word_freq_addresses,' 'word_freq_credit,' 'word_freq_000,' 'word_freq_money,' and 'char_freq_$,' are conspicuously absent from the list. This absence indicates their potential efficacy as strong indicators of spam emails in the Spambase dataset.
p_value_threshold = 0.05
non_significant_indicators = {
k: v for k, v in sorted_p_values.items() if v > p_value_threshold}
non_significant_indicators
{'word_freq_cs': 0.73532,
'word_freq_table': 0.28752,
'word_freq_conference': 0.25999,
'word_freq_project': 0.23742,
'word_freq_parts': 0.22839,
'word_freq_data': 0.13572,
'char_freq_[': 0.11276,
'word_freq_meeting': 0.08921}
spam_indicators = spam_indicators[:-1]
spam_indicators
['word_freq_3d', 'word_freq_000', 'word_freq_remove', 'word_freq_credit', 'char_freq_$', 'word_freq_addresses', 'word_freq_money']
Outlier Analysis ¶
In our exploration of the Spambase dataset, we aimed to identify outliers and understand their impact on the data. Initially, we plotted a boxplot for all the word frequency features, disregarding the distinction between spam and non-spam emails. This allowed us to observe the overall distribution of the data and identify potential outliers.
data_wr_fr = data.iloc[:, :-10]
data_char_freq = data.iloc[:, -10:-4]
data_capital_run = data.iloc[:, -4:-1]
def draw_boxplot(ax, label, data):
ax.boxplot(data,
vert=True,
patch_artist=True,
labels=data.columns)
ax.set_title(label)
ax.yaxis.grid(True)
ax.tick_params(labelrotation=90)
fig, ax = plt.subplots(figsize=(20, 5))
draw_boxplot(ax, 'Boxplots Word Frequencies', data_wr_fr)
plt.show()
Word frequencies ¶
To gain a deeper understanding of outliers within each class, we opted to create separate boxplots for spam and non-spam emails. This more nuanced approach allows us to discern specific characteristics within each category and better comprehend the distinctions in outliers between spam and non-spam instances.
Symmetry Differences:
- The symmetry of the same feature differs between spam and non-spam classes. For instance, consider
word_freq_george, a feature used to label non-spam emails. The asymmetry suggests that this feature may not exhibit similar behavior across both classes.
- The symmetry of the same feature differs between spam and non-spam classes. For instance, consider
Distinctive Spam Features:
- Certain features, such as
word_freq_3dandword_freq_credit, clearly stand out as potential indicators for classifying spam emails. These features demonstrate notable differences between spam and non-spam distributions, as previously stated.
- Certain features, such as
Potential Non-Spam Indicators:
- In examining boxplots for non-spam emails, features like
word_freq_hp,word_freq_lab, andword_freq_meetingemerge as potential indicators. Notably, non-spam distributions seem to harbor more outliers, suggesting potential discriminative power in these features.
- In examining boxplots for non-spam emails, features like
Common Minimal Value:
- It's important to note that each boxplot has a minimum value of 0, reflecting the inherent nature of the features, which are always positive.
These insights derived from the boxplots contribute to a nuanced understanding of feature behaviors within spam and non-spam categories, aiding in the identification of key indicators for effective email classification.
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(20, 9))
draw_boxplot(ax1, 'Boxplots Word Frequencies for SPAM emails', data_wr_fr[data['spam']==True])
draw_boxplot(ax2, 'Boxplots Word Frequencies for NON-SPAM emails', data_wr_fr[data['spam']==False])
fig.subplots_adjust(hspace=0.8)
plt.show()
Character frequencies ¶
Observing the outlier distributions concerning the frequency of certain characters, notable distinctions emerge:
char_freq_$:- A noticeable difference in distribution is observed between spam and non-spam emails for
char_freq_$. This discrepancy aligns with the common practice in spam emails, where fraudulent messages often involve mentions of free offerings (0$) or repetitive use of symbols like $$$.
- A noticeable difference in distribution is observed between spam and non-spam emails for
char_freq_;:- Another significant feature is
char_freq_;, which displays a correlation with non-spam emails beyond a certain frequency threshold. This correlation is logical as the presence of;is often associated with well-organized text (oppositely to this report :/)
- Another significant feature is
General Observation on Outliers:¶
In summary, outliers in our dataset carry informative value and contribute to the classification process. Notably, features like char_freq_$ and char_freq_; showcase distinctive patterns between spam and non-spam emails. Consequently, the decision has been made not to remove any outliers from our data.
char_freq_cols = data_char_freq.columns
fig, axes = plt.subplots(1, len(char_freq_cols), figsize=(18, 6))
for i, col in enumerate(char_freq_cols):
data.boxplot(by='spam', column=col, ax=axes[i])
axes[i].set_title(col)
fig.suptitle('Comparison of Character Frequencies')
plt.tight_layout()
plt.show()
Capital Run frequencies ¶
We also investigated the role of capital letters in distinguishing between spam and non-spam emails. Notably, the feature capital_run_length_average caught our interest, as it represents the average length of consecutive sequences of capital letters in an email. This metric proved to be a valuable indicator, showcasing a higher average presence of consecutive capital letters in spam emails compared to non-spam counterparts.
Upon visualizing the data, we observed that spam emails exhibit a tendency towards longer consecutive sequences of capital letters, suggesting a potential pattern that could aid in classification.
data_capital_run = data_capital_run.columns
fig, axes = plt.subplots(1, len(data_capital_run), figsize=(18, 6))
for i, col in enumerate(data_capital_run):
data.boxplot(by='spam', column=col, ax=axes[i])
axes[i].set_title(col)
fig.suptitle('Comparison of Capital Run Frequencies')
plt.tight_layout()
plt.show()
Interquartile Range (IQR) Analysis ¶
The Interquartile Range (IQR) serves as a crucial measure of statistical dispersion, representing the range between the first quartile (Q1) and the third quartile (Q3) within a dataset. It provides insights into the spread of the middle 50% of the data.
Interpreting the Values:¶
- Each column in
iqr_dfcorresponds to a feature from the dataset. - A larger IQR suggests a greater variability in the middle 50% of the data for a specific feature.
- A small IQR indicates that the central portion of the data is concentrated in a narrow range.
- By comparing IQR values between "spam" and "non_spam," we can identify features where the spread of data significantly differs for the two categories.
Example Interpretation:¶
For instance, if iqr_df indicates that the IQR for feature X is substantially larger in the "spam" category compared to the "nospam" category, it suggests that the spread of values for feature X is more diverse among spam instances.
This IQR analysis provides valuable insights into the distributional differences within numerical features, aiding in the identification of characteristics that may contribute to the classification of spam and non-spam instances.
columns = data.iloc[:, :-1].columns
spam_iqr = []
non_spam_iqr = []
for col in columns:
spam_q1,spam_q3 = data[data['spam']==True][col].quantile([1/4,3/4])
non_spam_q1, non_spam_q3 = data[data['spam']==False][col].quantile([1/4,3/4])
spam_iqr.append(spam_q3-spam_q1)
non_spam_iqr.append(non_spam_q3-non_spam_q1)
iqr_df = pd.DataFrame([spam_iqr, non_spam_iqr], columns=columns, index=["spam", "non_spam"])
iqr_df
| word_freq_make | word_freq_address | word_freq_all | word_freq_3d | word_freq_our | word_freq_over | word_freq_remove | word_freq_internet | word_freq_order | word_freq_mail | ... | word_freq_conference | char_freq_; | char_freq_( | char_freq_[ | char_freq_! | char_freq_$ | char_freq_# | capital_run_length_average | capital_run_length_longest | capital_run_length_total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| spam | 0.0017 | 0.0021 | 0.0064 | 0.0 | 0.0078 | 0.0024 | 0.0034 | 0.0019 | 0.0019 | 0.0051 | ... | 0.0 | 0.0 | 0.00144 | 0.0 | 0.00551 | 0.00211 | 0.00018 | 3.384 | 69.0 | 437.00 |
| non_spam | 0.0000 | 0.0000 | 0.0012 | 0.0 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | ... | 0.0 | 0.0 | 0.00222 | 0.0 | 0.00027 | 0.00000 | 0.00000 | 1.171 | 14.0 | 122.25 |
2 rows × 57 columns
The observed values in our analysis further confirm the earlier assertions regarding features that appear to be particularly informative in discerning between spam and non-spam emails. Notably, features such as capital_run_length_total, word_freq_free, and others exhibit noticeable distinctions in their Interquartile Range (IQR) values when stratified by the spam and non-spam classes.
The IQR, a measure of statistical dispersion, provides insight into the data spread within each class. The discernible differences in IQR values between spam and non-spam instances for specific features suggest that these variables carry significant discriminatory potential. For instance, capital_run_length_total indicates a variance in the total length of consecutive capital letters, while word_freq_free reflects the frequency of the word "free" in the email.
These findings reinforce the hypothesis that certain features possess inherent patterns or characteristics that contribute significantly to classifying emails into spam or non-spam categories.
iqr_df.loc['abs_diff'] = abs(iqr_df.loc["non_spam"] - iqr_df.loc["spam"])
iqr_df_transposed = iqr_df.T
iqr_df_transposed[iqr_df_transposed['abs_diff']>0].sort_values(by='abs_diff', ascending=False)
| spam | non_spam | abs_diff | |
|---|---|---|---|
| capital_run_length_total | 437.00000 | 122.250000 | 314.750000 |
| capital_run_length_longest | 69.00000 | 14.000000 | 55.000000 |
| capital_run_length_average | 3.38400 | 1.171000 | 2.213000 |
| word_freq_your | 0.01490 | 0.004600 | 0.010300 |
| word_freq_hp | 0.00000 | 0.010000 | 0.010000 |
| word_freq_our | 0.00780 | 0.000000 | 0.007800 |
| word_freq_free | 0.00640 | 0.000000 | 0.006400 |
| char_freq_! | 0.00551 | 0.000270 | 0.005240 |
| word_freq_all | 0.00640 | 0.001200 | 0.005200 |
| word_freq_mail | 0.00510 | 0.000000 | 0.005100 |
| word_freq_email | 0.00390 | 0.000000 | 0.003900 |
| word_freq_remove | 0.00340 | 0.000000 | 0.003400 |
| word_freq_business | 0.00340 | 0.000000 | 0.003400 |
| word_freq_000 | 0.00340 | 0.000000 | 0.003400 |
| word_freq_hpl | 0.00000 | 0.003300 | 0.003300 |
| word_freq_money | 0.00290 | 0.000000 | 0.002900 |
| word_freq_re | 0.00050 | 0.003125 | 0.002625 |
| word_freq_over | 0.00240 | 0.000000 | 0.002400 |
| char_freq_$ | 0.00211 | 0.000000 | 0.002110 |
| word_freq_address | 0.00210 | 0.000000 | 0.002100 |
| word_freq_order | 0.00190 | 0.000000 | 0.001900 |
| word_freq_internet | 0.00190 | 0.000000 | 0.001900 |
| word_freq_make | 0.00170 | 0.000000 | 0.001700 |
| word_freq_people | 0.00170 | 0.000000 | 0.001700 |
| word_freq_george | 0.00000 | 0.001625 | 0.001625 |
| word_freq_receive | 0.00140 | 0.000000 | 0.001400 |
| word_freq_1999 | 0.00000 | 0.001000 | 0.001000 |
| word_freq_will | 0.00840 | 0.007525 | 0.000875 |
| char_freq_( | 0.00144 | 0.002220 | 0.000780 |
| word_freq_you | 0.02050 | 0.019925 | 0.000575 |
| char_freq_# | 0.00018 | 0.000000 | 0.000180 |
Multicollinearity ¶
Before starting with the classification methods, as we studied the subject, we should address a potential issue in our dataset: multicollinearity. Multicollinearity occurs when independent variables in a multiple regression model display high correlations among themselves. This correlation between independent variables (our features) can pose challenges in distinguishing the individual effects of these features on the dependent variable (the class of the email, spam, or non-spam). In such cases, we can try to remove the correlated variables (we can identify correlated variables with the correlation matrix, by plotting an heatmap), apply some feature selection method or perform Principal Component Analysis (PCA).
Analyzing Correlations in the Spambase Dataset¶
In this analysis of the Spambase dataset, we aim to uncover potential relationships and dependencies between different parameters by utilizing a correlation matrix. A correlation matrix is a tabular representation of correlation coefficients between variables in a dataset. The correlation coefficient quantifies the strength and direction of a linear relationship between two variables. In this specific investigation, we choose to employ the Kendall correlation coefficient as opposed to Pearson, considering the presence of outliers in the dataset. The Kendall correlation coefficient is particularly robust in scenarios with outliers and non-normally distributed data. It measures the strength of dependence between two variables by comparing the number of concordant and discordant pairs of observations. The formula for calculating the Kendall correlation coefficient, denoted as $\tau$, is as follows:
$$\tau = \frac{{\text{{Number of concordant pairs}} - \text{{Number of discordant pairs}}}}{{\text{{Total number of pairs}}}}$$Here, concordant pairs are those with the same order of ranks in both variables, while discordant pairs have different orderings. By employing the Kendall correlation coefficient, we aim to gain insights into potential associations among the parameters in the Spambase dataset while accounting for its unique characteristics, including the presence of outliers.
import seaborn as sns
new_df = data.iloc[:, :-1].copy()
plt.rcParams.update({'figure.figsize':(60,55), 'figure.dpi':100})
correlation_matrix = new_df.corr(method='spearman')
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", vmin=-1, vmax=1, cbar=True, cmap='coolwarm', annot_kws={'size': 15})
plt.show()
threshold = 0.7
high_corr = correlation_matrix[abs(correlation_matrix) > threshold]
np.fill_diagonal(high_corr.values, np.nan)
mask = np.triu(np.ones_like(high_corr, dtype=bool))
inverse_mask = ~mask
high_corr_masked = high_corr * inverse_mask
high_corr_masked.dropna(how='all', axis=1, inplace=True)
high_corr_masked.dropna(how='all', axis=0, inplace=True)
mask = np.triu(np.ones_like(high_corr_masked, dtype=bool))
plt.rcParams.update({'figure.figsize':(15,11), 'figure.dpi':100})
sns.heatmap(high_corr_masked, mask=mask, annot=True, fmt=".2f", vmin=-1, vmax=1, cbar=True, cmap='coolwarm', annot_kws={'size': 15})
plt.show()
We can decide to remove
word_freq_hplsince it is correlated with word_freq_hpword_freq_telnetsince it is correlated with word_freq_857 and word_freq_415word_freq_857since it is correlated with word_freq_415word_freq_85since it is correlated with word_freq_650capital_run_length_longestsince it is correlated with capital_run_length_average
high_corr_attributes = ['word_freq_hpl', 'word_freq_telnet', 'word_freq_857', 'word_freq_85', 'capital_run_length_longest']
Classification Algorithms ¶
Logistic Regression ¶
In this analysis, we utilize the logistic regression model from the statsmodels library to classify the emails.
To assess the performance of our logistic regression model, we consider several key statistical metrics:
- Pseudo R-squared:
Pseudo R-squared is an important metric in the context of logistic regression, which serves as an analogous measure to the R-squared used in linear regression. It provides an indication of the explanatory power of the model, pointing out how well the logistic model performs compared to a baseline model that predicts the outcome using no features. It's a tool for model comparison rather than an absolute measure of fit. The value of Pseudo R-squared lies between 0 and 1. A value closer to 1 indicates that the model has a strong explanatory power.
LLR (Log-Likelihood Ratio) p-value: This metric tests the null hypothesis that all coefficients are zero (i.e., the model is no better than an intercept-only model). A small LLR p-value suggests that our model is statistically significant in distinguishing between spam and non-spam emails.
P > |t| for the parameters: This indicates the probability of observing a t-statistic as extreme as the one computed under the null hypothesis that a particular coefficient is zero. Smaller values suggest that the corresponding feature plays a significant role in predicting whether an email is spam.
By analyzing these parameters, we can understand not only the performance of our model but also the importance of different features in the spam detection task.
names_list_filepath = 'spambase/names.txt'
attribute_names = []
with open(names_list_filepath, 'r') as file:
attribute_names = file.read().splitlines()
data = pd.read_csv('spambase/spambase.data', names=attribute_names)
data.iloc[:, :-4] /= 100
column_name_mapping = {'char_freq_;':'char_freq_semicolon',
'char_freq_(':'char_freq_round_bracket',
'char_freq_[':'char_freq_square_bracket',
'char_freq_!':'char_freq_exclamation',
'char_freq_#':'char_freq_hash',
'char_freq_$':'char_freq_dollar',
'Class':'spam'}
data.rename(columns=column_name_mapping, inplace=True)
data_attributes = data.columns.tolist()[:-1]
Multicollinearity in Logistic Regression ¶
In our analysis of the Spambase dataset using logistic regression, we encounter an issue of multicollinearity (as shown previously). When multicollinearity is present, it becomes difficult to isolate the individual effect of each predictor on the response variable. High multicollinearity among predictors leads to inflated standard errors in the regression coefficients, which in turn can lead to a wider confidence interval and less reliable probability values (P > |t|) for the hypothesis tests.
- Evidence of Multicollinearity - Singular Matrix Error:
In our case, the application of logistic regression to the Spambase dataset using all features resulted in aLinAlgError: Singular matrix. This error is indicative of a perfect or near-perfect collinearity among some of the variables. It implies that the matrix of predictors cannot be inverted, which is a requisite for the regression analysis. This scenario often arises when the data includes redundant variables (predictors that are linear combinations of other predictors).
email_train, email_test = train_test_split(data, test_size=0.25, random_state=0)
formula = "spam ~ " + " + ".join(data_attributes)
model = logit(formula, email_train).fit()
summary = model.summary()
summary
Warning: Maximum number of iterations has been exceeded.
Current function value: inf
Iterations: 35
--------------------------------------------------------------------------- LinAlgError Traceback (most recent call last) /home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb Cell 75 line 5 <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=0'>1</a> email_train, email_test = train_test_split(data, test_size=0.25, random_state=0) <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=2'>3</a> formula = "spam ~ " + " + ".join(data_attributes) ----> <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=4'>5</a> model = logit(formula, email_train).fit() <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=5'>6</a> summary = model.summary() <a href='vscode-notebook-cell://wsl%2Bubuntu-20.04/home/rosario/Documents/FAD/UCI-Spambase/uci_spambase.ipynb#Y153sdnNjb2RlLXJlbW90ZQ%3D%3D?line=6'>7</a> summary File ~/anaconda3/envs/fad/lib/python3.9/site-packages/statsmodels/discrete/discrete_model.py:2599, in Logit.fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 2596 @Appender(DiscreteModel.fit.__doc__) 2597 def fit(self, start_params=None, method='newton', maxiter=35, 2598 full_output=1, disp=1, callback=None, **kwargs): -> 2599 bnryfit = super().fit(start_params=start_params, 2600 method=method, 2601 maxiter=maxiter, 2602 full_output=full_output, 2603 disp=disp, 2604 callback=callback, 2605 **kwargs) 2607 discretefit = LogitResults(self, bnryfit) 2608 return BinaryResultsWrapper(discretefit) File ~/anaconda3/envs/fad/lib/python3.9/site-packages/statsmodels/discrete/discrete_model.py:243, in DiscreteModel.fit(self, start_params, method, maxiter, full_output, disp, callback, **kwargs) 240 else: 241 pass # TODO: make a function factory to have multiple call-backs --> 243 mlefit = super().fit(start_params=start_params, 244 method=method, 245 maxiter=maxiter, 246 full_output=full_output, 247 disp=disp, 248 callback=callback, 249 **kwargs) 251 return mlefit File ~/anaconda3/envs/fad/lib/python3.9/site-packages/statsmodels/base/model.py:582, in LikelihoodModel.fit(self, start_params, method, maxiter, full_output, disp, fargs, callback, retall, skip_hessian, **kwargs) 580 Hinv = cov_params_func(self, xopt, retvals) 581 elif method == 'newton' and full_output: --> 582 Hinv = np.linalg.inv(-retvals['Hessian']) / nobs 583 elif not skip_hessian: 584 H = -1 * self.hessian(xopt) File ~/anaconda3/envs/fad/lib/python3.9/site-packages/numpy/linalg/linalg.py:561, in inv(a) 559 signature = 'D->D' if isComplexType(t) else 'd->d' 560 extobj = get_linalg_error_extobj(_raise_linalgerror_singular) --> 561 ainv = _umath_linalg.inv(a, signature=signature, extobj=extobj) 562 return wrap(ainv.astype(result_t, copy=False)) File ~/anaconda3/envs/fad/lib/python3.9/site-packages/numpy/linalg/linalg.py:112, in _raise_linalgerror_singular(err, flag) 111 def _raise_linalgerror_singular(err, flag): --> 112 raise LinAlgError("Singular matrix") LinAlgError: Singular matrix
Addressing Multicollinearity:¶
To resolve this issue, we can try to remove highly correlated features.
Variables Removed to Reduce Multicollinearity:
We identified and removed the following variables due to their high correlation with other features in the dataset (they are the same features we found before):word_freq_hplword_freq_telnetword_freq_857word_freq_85capital_run_length_longest
Impact on the Model:
After removing these features, the total number of features in our model was reduced to 52. This adjustment yielded a Pseudo R-squared of 0.7205, indicating a relatively strong explanatory power of the model with the reduced set of predictors.Convergence Issue:
Despite these adjustments, an important issue arose: the model did not converge after 35 iterations.Next Steps:
- Simplifying the model by reducing the number of predictors.
- Adjusting the fitting algorithm, such as increasing the number of iterations or changing the convergence criteria.
high_corr_attributes
data_attributes_no_corr = [attr for attr in data_attributes if attr not in high_corr_attributes]
email_train, email_test = train_test_split(data, test_size=0.25, random_state=0)
formula = "spam ~ " + " + ".join(data_attributes_no_corr)
model = logit(formula, email_train).fit()
summary = model.summary()
summary
Warning: Maximum number of iterations has been exceeded.
Current function value: 0.187157
Iterations: 35
| Dep. Variable: | spam | No. Observations: | 3450 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3397 |
| Method: | MLE | Df Model: | 52 |
| Date: | Tue, 19 Dec 2023 | Pseudo R-squ.: | 0.7205 |
| Time: | 17:26:37 | Log-Likelihood: | -645.69 |
| converged: | False | LL-Null: | -2310.5 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -1.8215 | 0.167 | -10.923 | 0.000 | -2.148 | -1.495 |
| word_freq_make | -45.4364 | 27.856 | -1.631 | 0.103 | -100.033 | 9.160 |
| word_freq_address | -11.2960 | 8.086 | -1.397 | 0.162 | -27.143 | 4.551 |
| word_freq_all | 22.2041 | 14.407 | 1.541 | 0.123 | -6.034 | 50.442 |
| word_freq_3d | 188.9055 | 141.728 | 1.333 | 0.183 | -88.877 | 466.688 |
| word_freq_our | 74.6857 | 14.184 | 5.265 | 0.000 | 46.885 | 102.486 |
| word_freq_over | 87.0471 | 29.119 | 2.989 | 0.003 | 29.975 | 144.119 |
| word_freq_remove | 236.9795 | 39.456 | 6.006 | 0.000 | 159.647 | 314.312 |
| word_freq_internet | 45.5503 | 15.609 | 2.918 | 0.004 | 14.958 | 76.142 |
| word_freq_order | 76.2801 | 38.573 | 1.978 | 0.048 | 0.679 | 151.881 |
| word_freq_mail | 2.2032 | 7.645 | 0.288 | 0.773 | -12.782 | 17.188 |
| word_freq_receive | 12.3526 | 37.611 | 0.328 | 0.743 | -61.364 | 86.069 |
| word_freq_will | -12.7096 | 8.412 | -1.511 | 0.131 | -29.196 | 3.777 |
| word_freq_people | 7.2314 | 29.409 | 0.246 | 0.806 | -50.409 | 64.871 |
| word_freq_report | 20.8883 | 16.436 | 1.271 | 0.204 | -11.325 | 53.102 |
| word_freq_addresses | 86.8922 | 71.429 | 1.216 | 0.224 | -53.107 | 226.891 |
| word_freq_free | 108.4484 | 17.778 | 6.100 | 0.000 | 73.605 | 143.292 |
| word_freq_business | 93.3197 | 25.837 | 3.612 | 0.000 | 42.681 | 143.959 |
| word_freq_email | 0.9827 | 13.957 | 0.070 | 0.944 | -26.372 | 28.337 |
| word_freq_you | 11.2640 | 4.127 | 2.729 | 0.006 | 3.175 | 19.353 |
| word_freq_credit | 159.7084 | 89.855 | 1.777 | 0.076 | -16.404 | 335.821 |
| word_freq_your | 29.1681 | 6.604 | 4.417 | 0.000 | 16.225 | 42.111 |
| word_freq_font | 15.2619 | 19.085 | 0.800 | 0.424 | -22.143 | 52.667 |
| word_freq_000 | 217.2654 | 52.468 | 4.141 | 0.000 | 114.430 | 320.101 |
| word_freq_money | 37.2177 | 15.865 | 2.346 | 0.019 | 6.123 | 68.312 |
| word_freq_hp | -268.1836 | 37.433 | -7.164 | 0.000 | -341.551 | -194.816 |
| word_freq_george | -1946.4215 | 319.258 | -6.097 | 0.000 | -2572.155 | -1320.688 |
| word_freq_650 | 53.0517 | 28.887 | 1.836 | 0.066 | -3.567 | 109.670 |
| word_freq_lab | -231.6616 | 155.425 | -1.491 | 0.136 | -536.288 | 72.965 |
| word_freq_labs | -58.6854 | 47.956 | -1.224 | 0.221 | -152.677 | 35.306 |
| word_freq_data | -125.9851 | 45.157 | -2.790 | 0.005 | -214.491 | -37.479 |
| word_freq_415 | -1160.4886 | 408.508 | -2.841 | 0.005 | -1961.150 | -359.827 |
| word_freq_technology | 120.5670 | 36.498 | 3.303 | 0.001 | 49.032 | 192.102 |
| word_freq_1999 | 10.9384 | 26.862 | 0.407 | 0.684 | -41.711 | 63.588 |
| word_freq_parts | 147.1036 | 120.475 | 1.221 | 0.222 | -89.023 | 383.230 |
| word_freq_pm | -98.5601 | 47.632 | -2.069 | 0.039 | -191.918 | -5.202 |
| word_freq_direct | -40.9160 | 40.089 | -1.021 | 0.307 | -119.488 | 37.656 |
| word_freq_cs | -5359.3845 | 5581.531 | -0.960 | 0.337 | -1.63e+04 | 5580.216 |
| word_freq_meeting | -308.1263 | 110.470 | -2.789 | 0.005 | -524.644 | -91.608 |
| word_freq_original | -248.3528 | 133.466 | -1.861 | 0.063 | -509.942 | 13.237 |
| word_freq_project | -143.7917 | 61.447 | -2.340 | 0.019 | -264.225 | -23.358 |
| word_freq_re | -80.0238 | 16.100 | -4.970 | 0.000 | -111.580 | -48.468 |
| word_freq_edu | -185.0862 | 36.148 | -5.120 | 0.000 | -255.934 | -114.238 |
| word_freq_table | -304.1867 | 260.651 | -1.167 | 0.243 | -815.053 | 206.680 |
| word_freq_conference | -478.9719 | 200.538 | -2.388 | 0.017 | -872.019 | -85.925 |
| char_freq_semicolon | -142.5491 | 54.074 | -2.636 | 0.008 | -248.532 | -36.566 |
| char_freq_round_bracket | -18.7805 | 31.375 | -0.599 | 0.549 | -80.275 | 42.714 |
| char_freq_square_bracket | -107.0447 | 141.624 | -0.756 | 0.450 | -384.623 | 170.534 |
| char_freq_exclamation | 22.0186 | 6.082 | 3.620 | 0.000 | 10.098 | 33.939 |
| char_freq_dollar | 559.6474 | 81.671 | 6.852 | 0.000 | 399.575 | 719.719 |
| char_freq_hash | 303.2639 | 123.727 | 2.451 | 0.014 | 60.763 | 545.765 |
| capital_run_length_average | 0.1026 | 0.020 | 5.011 | 0.000 | 0.062 | 0.143 |
| capital_run_length_total | 0.0015 | 0.000 | 6.146 | 0.000 | 0.001 | 0.002 |
Possibly complete quasi-separation: A fraction 0.29 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
test_probs = model.predict(email_test.dropna())
test_preds = test_probs.round().astype(int)
test_gt = email_test.dropna()['spam']
from sklearn.metrics import classification_report
print("Classification Report")
print(classification_report(test_gt, test_preds))
Classification Report
precision recall f1-score support
0 0.91 0.93 0.92 691
1 0.90 0.86 0.88 460
accuracy 0.90 1151
macro avg 0.90 0.90 0.90 1151
weighted avg 0.90 0.90 0.90 1151
Reducing Predictors ¶
To enhance our logistic regression model, we implemented a strategy of simplifying the model by reducing the number of predictors. This was guided by the statistical significance of each predictor, as indicated by their P>|z| values.
Criterion for Predictor Removal:
We chose to remove all features with P>|z| values greater than 0.05. This decision is based on the principle that features with a P>|z| value above this threshold are not statistically significant at the 5% level, implying that their contribution to the model in distinguishing spam from non-spam emails might be negligible.Variables Removed:
Based on the above criterion, the following variables were removed from the model:word_freq_emailword_freq_peopleword_freq_mail- ...
Outcome of the Model Refinement:
The removal of these variables resulted in a lighter model with only 28 parameters. Notably, the Pseudo R-squared value of this refined model is 0.7027, which is slightly lower than the previous value but still indicates substantial explanatory power.Model Convergence:
A significant improvement with this simplified model is its convergence. Unlike the previous versions, this model successfully converged in just 19 iterations. This faster convergence suggests that the reduced model is more stable and efficient in fitting the data.Interpretation:
The slightly lower Pseudo R-squared value suggests a marginal reduction in the model's explanatory power. However, this trade-off is offset by the benefits of a more parsimonious model, which typically offers better generalizability and interpretability. With fewer predictors, each remaining variable in the model is likely to be more meaningful and significant in distinguishing between spam and non-spam emails.
p_values = model.pvalues
if 'Intercept' in p_values.index:
p_values.drop('Intercept', inplace=True)
ordered_p_values = p_values.sort_values(ascending=False).round(3)
useful_p_values = ordered_p_values[ordered_p_values < 0.05]
useful_attributes = useful_p_values.index.tolist()
useful_attributes
['word_freq_order', 'word_freq_pm', 'word_freq_project', 'word_freq_money', 'word_freq_conference', 'char_freq_hash', 'char_freq_semicolon', 'word_freq_you', 'word_freq_meeting', 'word_freq_data', 'word_freq_415', 'word_freq_internet', 'word_freq_over', 'word_freq_technology', 'word_freq_business', 'char_freq_exclamation', 'word_freq_000', 'word_freq_your', 'word_freq_re', 'capital_run_length_average', 'word_freq_edu', 'word_freq_our', 'word_freq_remove', 'word_freq_george', 'word_freq_free', 'capital_run_length_total', 'char_freq_dollar', 'word_freq_hp']
email_train, email_test = train_test_split(data, test_size=0.25, random_state=0)
formula = "spam ~ " + " + ".join(useful_attributes)
model = logit(formula, email_train).fit()
summary = model.summary()
summary
Optimization terminated successfully.
Current function value: 0.199105
Iterations 19
| Dep. Variable: | spam | No. Observations: | 3450 |
|---|---|---|---|
| Model: | Logit | Df Residuals: | 3421 |
| Method: | MLE | Df Model: | 28 |
| Date: | Tue, 19 Dec 2023 | Pseudo R-squ.: | 0.7027 |
| Time: | 14:56:17 | Log-Likelihood: | -686.91 |
| converged: | True | LL-Null: | -2310.5 |
| Covariance Type: | nonrobust | LLR p-value: | 0.000 |
| coef | std err | z | P>|z| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | -1.9379 | 0.143 | -13.579 | 0.000 | -2.218 | -1.658 |
| word_freq_order | 82.0203 | 36.764 | 2.231 | 0.026 | 9.963 | 154.077 |
| word_freq_pm | -105.1509 | 45.252 | -2.324 | 0.020 | -193.843 | -16.459 |
| word_freq_project | -151.8314 | 62.519 | -2.429 | 0.015 | -274.366 | -29.297 |
| word_freq_money | 46.2219 | 18.188 | 2.541 | 0.011 | 10.573 | 81.870 |
| word_freq_conference | -650.7634 | 226.439 | -2.874 | 0.004 | -1094.576 | -206.950 |
| char_freq_hash | 375.1398 | 97.004 | 3.867 | 0.000 | 185.016 | 565.263 |
| char_freq_semicolon | -114.3148 | 35.136 | -3.254 | 0.001 | -183.179 | -45.450 |
| word_freq_you | 11.5959 | 3.942 | 2.942 | 0.003 | 3.870 | 19.322 |
| word_freq_meeting | -308.4246 | 114.030 | -2.705 | 0.007 | -531.919 | -84.931 |
| word_freq_data | -128.9851 | 42.758 | -3.017 | 0.003 | -212.790 | -45.180 |
| word_freq_415 | -1290.3735 | 410.433 | -3.144 | 0.002 | -2094.807 | -485.940 |
| word_freq_internet | 48.5591 | 14.959 | 3.246 | 0.001 | 19.239 | 77.879 |
| word_freq_over | 85.6357 | 28.570 | 2.997 | 0.003 | 29.640 | 141.631 |
| word_freq_technology | 128.9081 | 35.203 | 3.662 | 0.000 | 59.911 | 197.905 |
| word_freq_business | 101.1307 | 25.282 | 4.000 | 0.000 | 51.578 | 150.683 |
| char_freq_exclamation | 24.5126 | 6.684 | 3.667 | 0.000 | 11.411 | 37.614 |
| word_freq_000 | 214.6662 | 51.363 | 4.179 | 0.000 | 113.996 | 315.337 |
| word_freq_your | 24.3360 | 6.019 | 4.043 | 0.000 | 12.539 | 36.133 |
| word_freq_re | -80.7811 | 15.837 | -5.101 | 0.000 | -111.822 | -49.740 |
| capital_run_length_average | 0.1130 | 0.021 | 5.502 | 0.000 | 0.073 | 0.153 |
| word_freq_edu | -206.3152 | 37.021 | -5.573 | 0.000 | -278.875 | -133.755 |
| word_freq_our | 79.6953 | 14.045 | 5.674 | 0.000 | 52.168 | 107.222 |
| word_freq_remove | 247.4560 | 39.337 | 6.291 | 0.000 | 170.356 | 324.556 |
| word_freq_george | -2096.4694 | 330.758 | -6.338 | 0.000 | -2744.743 | -1448.196 |
| word_freq_free | 114.3631 | 17.211 | 6.645 | 0.000 | 80.630 | 148.096 |
| capital_run_length_total | 0.0015 | 0.000 | 6.823 | 0.000 | 0.001 | 0.002 |
| char_freq_dollar | 615.1742 | 84.351 | 7.293 | 0.000 | 449.849 | 780.499 |
| word_freq_hp | -275.2287 | 35.481 | -7.757 | 0.000 | -344.771 | -205.686 |
Possibly complete quasi-separation: A fraction 0.27 of observations can be
perfectly predicted. This might indicate that there is complete
quasi-separation. In this case some parameters will not be identified.
Logistic regression results¶
The classification report provides insights into the performance of our logistic regression model:
Precision: We achieve a high precision score of 0.90 for both classes (0 and 1), indicating a low false positive rate. This means that 90% of the instances predicted as spam or not spam are correct.
Recall: For class 0, the recall score is 0.94, indicating that the model correctly identifies 94% of all actual non-spam instances. However, for class 1, the recall is slightly lower at 0.85, suggesting that 15% of actual spam instances were not captured by the model.
F1-Score: The F1-score, which is the harmonic mean of precision and recall, is 0.92 for non-spam and 0.88 for spam. This confirms the balanced classification capability of the model.
Accuracy: Overall, the model achieves an accuracy of 0.90, which remains consistent across the macro average and weighted average. This underscores the model's robustness in correctly classifying emails as either spam or not spam.
test_probs = model.predict(email_test.dropna())
test_preds = test_probs.round().astype(int)
test_gt = email_test.dropna()['spam']
plt.rcParams.update({'figure.figsize':(8,8), 'figure.dpi':100})
conf_matrix = metrics.confusion_matrix(test_gt, test_preds)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix = conf_matrix, display_labels = ['non-spam', 'spam'])
cm_display.plot()
plt.show()
from sklearn.metrics import classification_report
print("Classification Report")
print(classification_report(test_gt, test_preds))
Classification Report
precision recall f1-score support
0 0.90 0.94 0.92 691
1 0.90 0.85 0.88 460
accuracy 0.90 1151
macro avg 0.90 0.89 0.90 1151
weighted avg 0.90 0.90 0.90 1151
Support Vector Machine (SVM) ¶
SVM Training with Grid Search ¶
We conducted a comprehensive parameter optimization using Grid Search. This approach focused on varying the kernel parameter, testing four different types: linear, poly (polynomial), rbf (radial basis function), and sigmoid.
Kernel Types and Their Differences¶
Each kernel type represents a different approach to transforming the input data into a higher-dimensional space:
- Linear Kernel: Simplistic and effective for linearly separable data, where a straight line can separate the classes.
- Polynomial Kernel (
poly): Suitable for non-linearly separable data, allowing the model to adapt to more complex relationships by raising the data to a specified power. - Radial Basis Function (
rbf): Highly effective for non-linear data, as it can handle complex, multidimensional relationships by measuring the distance from a central point.
Best Model: Linear Kernel¶
Remarkably, the linear kernel emerged as the best model for the Spambase dataset, indicating that despite potential complexities and non-linearities, the data is predominantly linearly separable. This suggests that a simpler linear decision boundary was sufficient and more effective for this specific dataset, avoiding the overfitting or unnecessary complexity that might arise with higher-order kernels.
Cross-Validation Accuracy and Comparison with Logistic Regression¶
The accuracy obtained from cross-validation with the SVM model was 0.75, which is lower than what was achieved using logistic regression. This disparity could be attributed to several factors. It can provide a more flexible fit to the data compared to SVM, which seeks to maximize the margin between classes. Additionally, the effectiveness of logistic regression in this context can be due also to its simplicity and robustness, particularly when dealing with datasets that, while potentially complex, still exhibit a strong linear component in their feature relationships.
data_useful_attributes = data[useful_attributes + ['spam']].copy()
X_data = data_useful_attributes.drop(columns=['spam']) # Features
y_data = data_useful_attributes['spam'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X_data,y_data, test_size=0.25, random_state=0)
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
# Define the parameter grid
param_grid = {'kernel': ['linear', 'poly', 'rbf']}
svm = SVC()
grid_search = GridSearchCV(svm, param_grid, cv=5, scoring='accuracy')
# Perform grid search
grid_search.fit(X_train, y_train)
print("Best Parameters:", grid_search.best_params_)
print("Best Score:", grid_search.best_score_)
best_svm = grid_search.best_estimator_
Best Parameters: {'kernel': 'linear'}
Best Score: 0.7460869565217391
# Evaluate the best model on the test set
y_pred = best_svm.predict(X_test)
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_pred))
Classification Report (Test Set):
precision recall f1-score support
0 0.72 0.94 0.82 691
1 0.83 0.46 0.59 460
accuracy 0.75 1151
macro avg 0.78 0.70 0.71 1151
weighted avg 0.77 0.75 0.73 1151
Impact of Data Normalization ¶
The Support Vector Machine (SVM) classification algorithm exhibited a noteworthy increase in accuracy upon the application of data normalization. Initially, the SVM model produced an accuracy of 0.75. However, after normalizing the dataset, the accuracy improved dramatically to 0.91. This enhancement underscores the profound impact that feature scaling can have on the performance of SVM.
This happens because Support Vector Machines are fundamentally sensitive to the scale of the input features due to the way they are designed to maximize the margin between different classes. In the absence of normalization, features with larger scales can distort this margin, giving undue weight to certain variables and potentially misguiding the optimization process of the SVM. Normalization brings each feature onto the same scale, making the distance measure more consistent across different dimensions.
scaler = StandardScaler()
# Fit on training set only
scaler.fit(X_train)
# Apply transform to both the training set and the test set
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the SVM Classifier on the scaled data
svm_model = SVC(kernel='linear') # You can change the kernel as needed
svm_model.fit(X_train_scaled, y_train)
# Predict on the scaled test data
y_pred = svm_model.predict(X_test_scaled)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
Classification Report:
precision recall f1-score support
0 0.92 0.94 0.93 691
1 0.91 0.87 0.89 460
accuracy 0.91 1151
macro avg 0.91 0.91 0.91 1151
weighted avg 0.91 0.91 0.91 1151
Decision Tree ¶
A decision tree is a machine learning algorithm that partitions the data into subsets based on the value of input features. It is akin to a flowchart where each internal node represents a test on an attribute, each branch corresponds to an outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). This model is popular for its interpretability and ease of use.
In the context of the Spambase dataset, we have opted for a decision tree model due to its effectiveness in handling categorical and continuous data, and its capability to model complex decision boundaries. Decision trees can also inherently perform feature selection, which can be particularly advantageous given the high dimensionality of the Spambase dataset.
As part of our modeling process, we will experiment with different max_depth values, which determine the maximum length of the paths from the root to the leaves. This is crucial for controlling the complexity of the tree and preventing overfitting. Additionally, we will explore various pruning parameters to refine the tree structure, ensuring that it generalizes well to new data. By tuning these parameters, we aim to build an optimized decision tree model that effectively classifies emails as spam or not spam while maintaining interpretability.
def plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title=""):
# Confusion Matrix
conf_matrix = metrics.confusion_matrix(y_test, y_test_predict)
cm_display = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=['Non-Spam', 'Spam'])
# Classification Report
class_report = classification_report(y_test, y_test_predict, output_dict=True)
df_report = pd.DataFrame(class_report).transpose()
# Plotting
fig, ax = plt.subplots(1, 2, figsize=(16, 8))
# Confusion Matrix
cm_display.plot(ax=ax[0])
ax[0].set_title('Confusion Matrix')
# Classification Report Metrics
df_report.iloc[:-3, :-1].plot(kind='bar', ax=ax[1])
ax[1].set_title('Classification Report Metrics')
ax[1].set_xticklabels(['Non-Spam', 'Spam'], rotation=0)
fig.suptitle(title, fontsize=16)
plt.tight_layout()
plt.show()
data_useful_attributes = data[useful_attributes + ['spam']].copy()
X_data = data_useful_attributes.drop(columns=['spam']) # Features
y_data = data_useful_attributes['spam'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X_data,y_data, test_size=0.25, random_state=0)
dt = DecisionTreeClassifier(max_depth=3, random_state=0)
train_tree_preds = dt.fit(X_train,y_train)
y_test_predict = dt.predict(X_test)
plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title="Decision Tree with max_depth=3")
print("Classification Report")
print(classification_report(y_test, y_test_predict))
Classification Report
precision recall f1-score support
0 0.86 0.95 0.90 691
1 0.91 0.76 0.83 460
accuracy 0.88 1151
macro avg 0.89 0.86 0.87 1151
weighted avg 0.88 0.88 0.87 1151
dot_data = export_graphviz(dt, out_file=None, feature_names=X_data.columns, class_names=['Non-Spam', 'Spam'], filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph
Grid Search and Cross Validation ¶
A DecisionTreeClassifier was trained with a max_depth parameter set to 3. This depth, representing the maximum length from the root to a leaf, aims to prevent overfitting by limiting the complexity of the decision tree. With this setting, the classifier achieved an accuracy score of (0.88), indicating a high level of prediction capability.
To further refine our model, we employ GridSearchCV, an exhaustive search over specified parameter values for an estimator. The parameters we are tuning are:
max_depth: ([5, 10, 15, 20]) - These values represent various levels of tree depth. A deeper tree (highermax_depth) can model more complex patterns but risks overfitting.ccp_alpha: ([0.0, 0.001, 0.01, 0.1]) - Cost-Complexity Pruning (CCP) alpha is used to prune the tree to avoid overfitting. It's a complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller thanccp_alphawill be chosen. A higher value ofccp_alphawill prune more aggressively.
The GridSearchCV results are:
Best Parameters:
{ccp_alpha: 0.001, max_depth: 15}. This implies that a tree depth of 15 and a slight pruning (CCP alpha of 0.001) yield the best trade-off between model complexity and generalization ability.Best Accuracy Score: (0.8987). This score is an improvement over the initial model, underscoring the efficacy of parameter tuning.
Depth of Best Tree: 11. Interestingly, even though the best
max_depthparameter was 15, the actual depth of the best-performing tree turned out to be 11, indicating that the optimal complexity for this dataset is achieved before reaching the maximum allowed depth.
param_grid = {
'max_depth': [5, 10, 15, 20], # Different max_depth values to test
'ccp_alpha': [0.0, 0.001, 0.01, 0.1] # Different pruning parameters to test
}
clf = DecisionTreeClassifier(random_state=0)
grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_data, y_data)
print("Best Parameters:", grid_search.best_params_)
print("Best Accuracy Score:", grid_search.best_score_)
print("Depth of Best Tree:", grid_search.best_estimator_.get_depth())
Best Parameters: {'ccp_alpha': 0.001, 'max_depth': 15}
Best Accuracy Score: 0.8987135910871926
Depth of Best Tree: 11
# Train the Decision Tree Classifier
dt = DecisionTreeClassifier(ccp_alpha=0.001, max_depth=15, random_state=0)
dt.fit(X_train, y_train)
y_test_predict = dt.predict(X_test)
plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title="Decision Tree with max_depth=15 and ccp_alpha=0.001")
dt = DecisionTreeClassifier(max_depth=3, random_state=0)
dt.fit(X_train, y_train)
y_test_predict = dt.predict(X_test)
plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title="Decision Tree with max_depth=3")
Random Forest ¶
We also explored the Random Forest algorithm as an alternative approach. A Random Forest is an ensemble learning method, which operates by constructing multiple decision trees during training and outputting the class that is the mode of the classes (classification) of individual trees.
Random Forest classifier was configured with n_estimators=100 (default), signifying that 100 trees would be built in the forest, and max_depth=3, limiting the depth of each tree to prevent overfitting. The ensemble nature of Random Forest typically yields a more accurate model compared to a single decision tree due to its ability to average out biases and reduce variance.
However, the mean accuracy over cross-validation for the Random Forest model was observed to be (0.9039).
This accuracy represents only a marginal improvement over the previously tested methods. It suggests that the Decision Tree model is already performing well and is well-tuned (like with optimal depth and pruning).
rf_model = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=0)
rf_model.fit(X_train, y_train)
y_test_predict = rf_model.predict(X_test)
plot_confusion_matrix_and_classification_report(y_test, y_test_predict, title="Random Forest with n_estimators=100 and max_depth=3")
scores = cross_val_score(rf_model, X_data, y_data, cv=5, scoring='accuracy')
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
Cross-Validation Scores: [0.9218241 0.90652174 0.93043478 0.91413043 0.84673913] Mean Accuracy: 0.9039300382382098
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_predict))
Classification Report (Test Set):
precision recall f1-score support
0 0.87 0.98 0.92 691
1 0.96 0.78 0.86 460
accuracy 0.90 1151
macro avg 0.92 0.88 0.89 1151
weighted avg 0.91 0.90 0.90 1151
K-Nearest Neighbors (K-NN) ¶
We also tried to employ the K-Nearest Neighbors (K-NN) algorithm, a well-known method in machine learning for its simplicity and effectiveness in classification tasks. Our approach was to fine-tune the model to determine the optimal number of neighbors (k) that would yield the best classification accuracy.
Grid Search for Hyperparameter Tuning ¶
To find the most suitable k value, we implemented a Grid Search strategy, varying k from 1 to 60. This exhaustive search allowed us to systematically traverse through a wide range of k values, aiming to pinpoint the one that maximizes the accuracy of our K-NN classifier.
Observations: Impact of Increasing k on Accuracy¶
One of the key observations from our analysis was the inverse relationship between the size of k and the accuracy of the model. Specifically, as k increased, there was a noticeable decline in accuracy. This trend can be attributed to the intrinsic workings of the K-NN algorithm. When k is small, the algorithm tends to capture the noise in the data, leading to overfitting. However, as k grows, the classifier starts to consider a broader set of neighbors for each query point. While this can reduce the impact of noise, it also increases the likelihood of including points from other classes within the neighborhood, consequently diluting the decision boundaries and diminishing the classifier's ability to distinguish accurately between classes.
Optimal Model: k = 2¶
The Grid Search identified that the model achieved its peak performance with k = 2. This suggests that a tighter, more localized decision boundary is preferable for this particular dataset, as it helps to maintain a balance between reducing noise and preserving the integrity of the class boundaries.
Cross-Validation Results and Comparison with Other Methods¶
Despite identifying an optimal k, the mean accuracy achieved through cross-validation was only 0.6955. This performance is considerably lower when compared to other classification methods applied to the same dataset. For instance, using the Random Forest algorithm, we obtained a mean accuracy of 0.9039. The relatively lower efficiency of the K-NN model in this scenario can be primarily attributed to the characteristics of the Spambase dataset. Given its high dimensionality and potential noise, distance-based methods like K-NN face challenges.
X_train, X_test, y_train, y_test = train_test_split(X_data,y_data, test_size=0.25, random_state=0)
X_train = np.ascontiguousarray(X_train)
X_test = np.ascontiguousarray(X_test)
y_train = np.ascontiguousarray(y_train)
y_test = np.ascontiguousarray(y_test)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
# Define a range of 'k' values for K-NN
k_range = list(range(1, 61))
# Create a K-NN classifier
knn = KNeighborsClassifier()
# Create a dictionary of all values we want to test for 'n_neighbors'
param_grid = dict(n_neighbors=k_range)
# Use grid search to test all values for 'n_neighbors'
grid = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
grid_results = grid.cv_results_
# Extract the mean test scores for each parameter
mean_test_scores = grid_results['mean_test_score']
plt.figure(figsize=(12, 6))
plt.plot(k_range, mean_test_scores, color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Accuracy vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy')
plt.grid(True)
plt.show()
best_k = grid.best_params_['n_neighbors']
best_score = grid.best_score_
print("Best K Value:", best_k)
print("Best Score:", best_score)
Best K Value: 2 Best Score: 0.711304347826087
X_data = np.ascontiguousarray(X_data)
y_data = np.ascontiguousarray(y_data)
# Use the best parameter to make predictions
knn_best = KNeighborsClassifier(n_neighbors=best_k)
scores = cross_val_score(knn_best, X_data, y_data, cv=5, scoring='accuracy')
print("Cross-Validation Scores:", scores)
print("Mean Accuracy:", scores.mean())
Cross-Validation Scores: [0.66340934 0.70108696 0.73478261 0.71630435 0.66195652] Mean Accuracy: 0.6955079544918095
X_train = np.ascontiguousarray(X_train)
X_test = np.ascontiguousarray(X_test)
y_train = np.ascontiguousarray(y_train)
knn_best.fit(X_train, y_train)
y_test_predict = knn_best.predict(X_test)
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_predict))
Classification Report (Test Set):
precision recall f1-score support
0 0.71 0.91 0.79 691
1 0.76 0.44 0.55 460
accuracy 0.72 1151
macro avg 0.73 0.67 0.67 1151
weighted avg 0.73 0.72 0.70 1151
Impact of Data Normalization ¶
A critical aspect of our study involved the normalization of data prior to the application of the K-Nearest Neighbors (K-NN) algorithm. By standardizing the feature set, we observed a substantial improvement in the model's performance: the accuracy surged from 0.72 to an impressive 0.89. This significant enhancement in accuracy highlights the importance of normalization in the preprocessing phase, particularly for distance-based algorithms like K-NN.
scaler = StandardScaler()
# Fit on training set only
scaler.fit(X_train)
# Apply transform to both the training set and the test set
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
knn_best.fit(X_train_scaled, y_train)
y_test_predict = knn_best.predict(X_test_scaled)
print("\nClassification Report (Test Set):")
print(classification_report(y_test, y_test_predict))
Classification Report (Test Set):
precision recall f1-score support
0 0.87 0.97 0.91 691
1 0.94 0.78 0.85 460
accuracy 0.89 1151
macro avg 0.90 0.87 0.88 1151
weighted avg 0.90 0.89 0.89 1151
Conclusion ¶
We evaluated a variety of classification algorithms, including Logistic Regression, Logistic Regression with Backward Feature Elimination (BFE), Support Vector Machine (SVM), SVM with Normalized Data, Decision Trees, Random Forest, K-Nearest Neighbors (K-NN), and K-NN with Normalized Data. The performance of these algorithms was compared based on three metrics: accuracy, macro average, and weighted average.
Our findings suggest that, overall, the classification algorithms exhibited similar performance. Notably, Logistic Regression, both with and without BFE, Random Forest, and Decision Trees demonstrated robustness in accuracy and consistency across the metrics without the need for data normalization. However, it was observed that SVM and K-NN algorithms significantly benefited from data normalization. This improvement underscores the importance of preprocessing steps in data analysis, particularly when utilizing algorithms sensitive to the scale of the data, enhancing their ability to classify the data more effectively, which is evident from the increased scores in all evaluation metrics post-normalization. Such an enhancement is attributed to the fact that both SVM and K-NN rely on distance calculations, which are profoundly affected by the scale of the features.
In summary, the comparative study has provided valuable insights into the strengths and limitations of each algorithm when applied to the Spambase dataset. The key takeaway is the critical role of data preprocessing and the selection of appropriate algorithms based on the data characteristics and the desired outcome of the model.
data = {
'Algorithm': ['Logistic Regression', 'Logistic Regression BFE', 'SVM', 'SVM Normalized', 'Decision Trees', 'Random Forest', 'K-NN', 'K-NN Normalized'],
'accuracy': [0.90, 0.90, 0.75, 0.91, 0.88, 0.90, 0.72, 0.89],
'macro avg': [0.90, 0.90, 0.71, 0.91, 0.87, 0.89, 0.67, 0.88],
'weighted avg': [0.90, 0.90, 0.73, 0.91, 0.87, 0.90, 0.70, 0.89]
}
df = pd.DataFrame(data)
df = pd.melt(df, id_vars="Algorithm", var_name="Study Cases", value_name="Accuracy")
plt.figure(figsize=(12, 8))
g = sns.barplot(x='Algorithm', y='Accuracy', hue='Study Cases', data=df)
g.set_yticks(np.arange(0, 1.01, 0.05))
plt.title('Comparative Analysis of Classification Algorithms')
plt.xlabel('Algorithms')
plt.xticks(rotation=45)
plt.show()